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Abstract 

This  paper  describes  two  different  approaches  to  dealing  with  the  ini¬ 
tial  transient  problem.  In  the  first  approach,  the  length  of  the  “warm-up 
period”  is  determined  by  obtaining  analytical  estimates  on  the  rate  of  con¬ 
vergence  to  stationarity.  Specifically,  we  obtain  an  upper  bound  on  the 
“second  eigenvalue”  of  the  transition  matrix  of  a  Markov  chain,  thereby 
providing  one  with  a  theoretical  device  that  potentially  can  give  estimates 
of  the  desired  form.  The  second  approach  is  data-driven,  and  involves  us¬ 
ing  observed  data  from  the  simulation  to  determine  an  estimate  of  the 
“warm-up  period”.  For  the  method  we  study,  we  are  able  to  use  a  cou¬ 
pling  argument  to  establish  a  number  of  important  theoretical  properties 
of  the  algorithm. 

Key  Words:  Initial  transient,  Markov  processes,  eigenvalues,  coupling,  total 
variation  norm,  rates  of  convergence,  simulation. 

1  Introduction 

In  many  applications  settings,  it  is  of  interest  to  compute  steady-state  perfor¬ 
mance  measures.  To  be  specific,  suppose  that  the  system  under  consideration 
is  described  by  a  Markov  process  X  =  (X(i)  :  t  >  0)  living  on  state  space  5. 
For  a  given  /  :  5  — >1R,  the  steady-state  simulation  problem  is  concerned  with 
the  estimation  of  the  time- average  limit  a  defined  via  the  law  of  large  numbers 

f{X{s))d$  — >  a  Px  a.s. 
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as  f  00  for  all  a:  6  5  (assuming  such  a  limit  exists),  where 

Px(-)  4  P(-|X(0)  =  x). 

Such  laws  of  large  numbers  hold,  in  great  generality,  for  Markov  processes  ex¬ 
hibiting  some  type  of  positive  recurrence  condition.  In  addition,  for  such  pro¬ 
cesses,  there  typically  exists  a  unique  probability  distribution  tt,  known  as  the 
stationary  distribution,  such  that 

Q  =  y  f{x)iT{dx). 

Furthermore,  tt  has  the  property  that  if  X{0)  has  distribution  tt,  then  X  is  a 
(strictly)  stationary  process.  It  follows  that  if  X  is  initiated  with  distribution 
TT,  then  the  sample  mean  a{t)  given  by 

a{t)  4  ^J^fiX{s))ds 

is  unbiased  as  an  estimator  of  q. 

Unfortunately,  the  distribution  tt  is  typically  unknown  and,  consequently,  it 
is  generally  impractical  to  generate  X(0)  from  tt.  As  a  result,  bias  is  induced 
in  a(t)  and  the  initial  segment  of  the  simulation  may  be  unrepresentative  of 
steady-state  behavior.  This  introduces  certain  complications  into  the  estima¬ 
tion  problem  that  do  not  arise  in  Monte  Carlo  environments  in  which  unbiased 
estimators  may  easily  be  constructed.  The  “initial  transient  problem”  focuses 
both  on  the  effect  of  this  initial  bias,  and  on  developing  effective  algorithms  for 
mitigating  the  impact  of  this  bias. 

One  means  of  attacking  this  problem  is  to  note  that  the  positive  recurrent 
Markov  processes  (satisfying  some  sort  of  aperiodicity  condition)  typically  ex¬ 
hibit  “total  variation  convergence”  to  stationarity,  by  which  we  mean  that 

||P(X,  e-)-Px(:^e-)||->o 

as  t  ^  00,  where  AT*  4  ( A:(f  -h  s)  :  s  >  0)  is  the  “post-<”  process,  P^(  )  is 

the  distribution  under  which  X  has  initial  distribution  tt,  and  ||  •  ||  is  the  total 
variation  norm  defined  by 

IIpII  =  sup|7y(A)| 

A 

for  any  signed  measure  77.  Thus,  if  one  can  compute  a  time  s  for  which 

||P(X,  €-)-Px(X€-)l|<c, 

we  have  an  e-guarantee  that  the  post-s  process  is  close  to  stationarity,  and  hence 
any  data  collected  subsequent  to  s  should  have  relatively  low  bias. 
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The  remainder  of  this  paper  describes  two  different  approaches  to  accom¬ 
plishing  this  task.  In  section  2,  we  establish  a  new  analytical  bound  for  the  total 
variation  distance  from  stationarity  that  takes  explicit  advantage  of  the  known 
transition  structure  of  the  system.  By  contrast,  section  3  is  concerned  with 
developing  a  new  method  for  identifying  s  that  is  purely  data-driven,  and  takes 
no  explicit  advantage  of  the  transition  structure  of  the  system  being  simulated. 

2  Upper  Bounds  on  Rates  of  Convergence  to 
Stationarity 

As  indicated  in  section  1,  we  are  concerned  here  with  developing  upper  bounds 
on  the  rate  of  convergence  to  stationarity,  as  described  via  the  total  variation 
norm.  For  the  remainder  of  this  section,  we  shall  assume  that  X  is  an  aperiodic, 
irreducible,  discrete-time  Markov  chain  with  finite  state  space  (although  one 
would  expect  appropriate  analogs  in  both  continuous  time  and  general  state 
space). 

We  start  by  noting  that  for  t  G  Z"^, 

l|P(X,e.)-P.(^€-)ll  =  mx{t)e-)-PAXe^)\\ 

y 

where  P  =  ( Pxy  :  x,y  e  S)  is  the  transition  matrix  of  X.  Let  tt  be  the  unique 
stationary  distribution  of  X,  and  let  11  be  the  matrix  having  all  rows  identical 
to  TT.  Since  PTl  =  UP  =  11^,  it  is  evident  (via  an  inductive  proof)  that  for 
^  >  1, 

-  n  =  (p  -  n)^. 

Clearly,  the  rate  of  convergence  of  P^  to  11  is  therefore  related  to  the  structure 
of  the  eigenvalues  of  P  -  H.  In  particular,  let  Ai,  A2, . . . ,  be  the  distinct 
(complex-valued)  eigenvalues  of  P  -  11,  with  corresponding  (complex- valued) 
eigenvectors  ui,  U2, . .  ■ ,  Note  that 

lim  ilog(||P(Xt  €  ■)-PAX  e  OH)  <  log(7)  (1) 

n— *00  n 

where  7  =  max(  |Ai|  :  1  <  i  <  d).  Thus,  a  bound  on  7  yields  a  bound  on  the 
rate  of  convergence  to  stationarity. 

In  view  of  this,  let  (A,  u)  be  an  eigenvalue-eigenvector  pair  corresponding  to 
P  —  n  so  that 

P^u  —  Uu  = 

for  n  >  1.  In  other  words. 


Ea:?i(An)  -  TTU  =  X^u{x) 


(2) 
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for  X  G  S.  Assume  A  ^  0,  and  set 


Mn  =  X  -  7ru(l  -  A")(l  -  A)-1]  . 

(Clearly,  |A|  <  1  since  (P  -  E)"  ->  0,  and  thus  (1  -  A)-i  is  finite.)  We  claim 
that  ( :  n  >  0 )  is  a  martingale  with  respect  to  =  (T(Aro, ...,Xn). 

To  verify  this,  note  that  since  S  is  finite,  is  integrable  and  adapted  to 
(Pn  :  n  >  0).  Furthermore,  (2)  implies  that 

Ea;(M„+i|P„)  =  A~"~^(Au(Xn)  +  TTu  —  7ru(l  —  A""*‘^)(l  —  A)~^) 

=  A~"(u(A:„)-7rw(l-A”)(l-A)~')  =  M„, 

so  ( Mji  :  n  >  0 )  is  indeed  a  (complex-valued)  martingale. 

Let  T{x)  =  inf{n  >  1  :  Ar„  =  x}  be  the  (first)  hitting  time  of  x.  Set 
Dj  =  Mj  —  and  observe  that  for  n  >  1, 

n 

=  E^Mo  +  '^E^DjI{T{x)>j)  (3) 

n 

=  E,Mo  +  E,/(r(x)  >  j)E,iDj\T,_,) 

j=i 

=  Ex  Mo. 

(Thus,  the  optional  sampling  identity  continues  to  hold  despite  the  fact  that 
(M„  :  n  >  0)  is  not  real  valued).  Note  that  MT^x)An  ^  Mx(x)  a.s.  (since 
T(x)  is  finite  valued).  Furthermore,  if  Ex|A|-^(*)  <  oo,  then  the  Dominated 
Convergence  Theorem,  applied  to  (3),  yields 

ExMt(x)  =  ExMo, 

so  that 


Ex(A  -  7ru(l  -  A^(^>)(1  -  A)“^))  =  u(x). 

But  u{Xt{x))  =  u{x)  and  Ex(l  -  ^  0  (since  >  |A|-i  >  1),  from 

which  it  follows  that  if  ExIAl"^^'^)  <  oo, 

u{x)  =  7ru(l  —  A)~^ 

Thus,  if  Ey|A|  <  00  for  all  y  ^  evidently  we  would  obtain  u(y)  = 
7ru(l  -  A)“^  for  all  y  e  S.  This  is  a  contradiction  (as  is  easily  seen  by  taking  tt 
of  both  sides).  Hence,  there  exists  y  e  S  such  that  Ey|A|“^^2/)  = 

Let  0(y)  =  sup{  |A|  :  Ey|A|~^^^^  <  00}  be  the  radius  of  convergence  of 
the  probability  generating  function  of  T{y).  We  have  just  shown  above  that 
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there  exists  y  ^  S  such  that  (5(y)  <  |A|"^  So  |A|~^  >  mm{P{y)  :  y  e  S)  or, 
equivalently, 

|A|  <  max{P{y)~^  :  y  e  5), 

yielding  the  bound 

7  <  max(/5(y)"^  :  2/ G  5). 

We  can  summarise  the  above  discussion  with  the  following  theorem. 

Theorem  1  Let  X  be  a  finite  state  aperiodic  irreducible  discrete-time  Markov 
chain.  Then, 

lim  -  log(!|P(Xt  e  •)  -  e  OH)  <  log(max/3(2/)-i), 

n-»oo  n  y^S 

where  0{y)  =  sup{  2  :  <  oo  }. 

This  result  bounds  the  rate  of  convergence  to  stationarity,  in  terms  of  the 
rate  at  which  the  chain  X  returns  to  the  various  states  of  S.  In  certain  settings, 
probability  arguments  can  then  be  used  to  a  priori  dominate  the  /3(y)’s. 

The  theorem  above  complements  the  many  other  results  that  have  been 
developed  in  recent  years  to  bound  the  rate  of  convergence  to  stationarity;  see, 
for  example,  Diaconis  and  Stroock  [3],  Fill  [4],  and  Meyn  and  Tweedie  [8]. 
In  certain  highly  structured  models,  these  analytic  tools  turn  out  to  be  quite 
powerful,  and  the  bounds  obtained  are  relatively  tight.  However,  in  general, 
it  is  probably  fair  to  say  that  for  unstructured  systems,  the  bounds  are  often 
quite  loose  and  consequently  of  less  practical  value.  In  addition,  a  glance  at  (1) 
makes  clear  that  a  bound  on  7  does  not  necessarily  provide  a  bound  on  the  total 
variation  distance  between  P(Xf  G  •)  and  P7r(Af  G  •)  (although  some  analytical 
tools  give  a  bound  also  on  the  total  variation  distance). 

Another  criticism  of  the  above  approach  is  that  for  unbounded  functions 
/  (as  often  arise  in  engineering  applications),  a  bound  on  the  total  variation 
distance  does  not  translate  into  a  bound  on  \Ef{X{t))  —  E7r/(X(0))|,  and  hence 
the  information  obtained  about  the  bias  of  a{t)  is  somewhat  limited. 


3  A  Data-Driven  Stationarity  Detection  Rule 

In  section  2,  our  concern  was  with  describing  upper  bounds  on  the  total  variation 
rate  of  convergence  that  takes  explicit  account  of  the  specific  transition  structure 
of  the  process.  However,  historically,  it  is  fair  to  say  that  the  most  widely  used 
methods  for  determining  the  time  to  stationarity  have  made  such  assessments 
based  purely  on  the  observed  data  obtained  by  simulating  the  system  itself;  see, 
for  example,  Conway  [2]  and  Gafarian,  Ancker,  and  Morisaku  [5].  A  principal 
difficulty  with  this  approach  is  that  very  few  such  data-driven  methods  have 
come  equipped  with  any  theoretical  guarantees;  see,  however,  Asmussen,  Glynn, 
and  Thorisson  [1]  for  some  noteworthy  exceptions. 
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Here,  we  analyze  a  data-driven  rule  first  proposed  in  Glynn  and  Iglehart  [6], 
and  show  that  it  enjoys  some  theoretically  important  properties.  Our  goal  is  to 
define  a  non-negative  family  of  random  variables  ( T{t)  :t>0)  such  that: 

a  T{t)  <  t  a.s., 

(4) 

b  P(T(t)  e  -IX)  =  P(T(f)  €  •|X(s)  :  0  <  s  <  f)  for  f  >  0, 

c  ||P(X7’(i)  6  •)  -  P7r(-^  6  OH  -♦  0  as  t  — >  00, 

d  ( T{t)  :  t  >  0 )  is  a  tight  family  of  r.v.’s  (under  P). 

For  a  simulation  time  horizon  t,  we  then  view  T{t)  as  the  epoch  at  which 
the  process  X  is  in  approximate  stationarity.  Condition  b  above  states  that 
T{t)  can  be  generated  once  X  has  been  simulated  to  time  f,  whereas  a  forces 
T{t)  to  be  in  the  interval  [0,t].  Condition  c  asserts  that  X  is  in  approximate 
stationarity  at  time  T(t),  whereas  d  rules  out  detection  rules  that  throw  out 
more  and  more  data  as  t  — >  oo  (e.g.  T{t)  = 

The  rule  proposed  in  Glynn  and  Iglehart  [6]  is  described  by  the  following 
algorithm: 


1.  Simulate  X  to  time  t. 

2.  Generate  a  uniform  r.v.  U,  independent  of  X. 

3.  Set  T{t)  =  inf{  s  >  0  :  X{s)  =  X{Ut)  }. 

Clearly  T{t)  <  t  always  holds.  Set  Z{t)  =  X{Ut)  and  note  that 

P(Z(f)  6  -IJf)  =  7r,(.) 

where  Ttt{-)  is  the  empirical  distribution  of  X  defined  by 

M-)  =  \  f^I{X{s)e-)ds. 

Hence,  (4)b  also  holds.  To  establish  c,  we  need  to  restrict  the  class  of 
processes  X  under  consideration.  Specifically,  suppose  that  X  is  an  irreducible, 
positive  recurrent,  continuous-time  Markov  chain  living  on  a  finite  or  countably 
infinite  state  space.  Then,  the  law  of  large  numbers  for  such  processes  guarantees 
that  for  each  x  e  S, 

1 

7rt(x)  =  7  /  I{X (s)  =  x)ds 7r{x)  a.,s. 

*  Jo 
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as  t  00,  where  tt  is  the  (unique)  stationary  distribution  of  X.  Since  S  is 
discrete,  it  then  easily  follows  that 


Ikt  -  7r||  — ►  0  a.s. 


(5) 


as  t  ^  00. 

We  now  use  a  coupling  argument  to  complete  the  proof  of  c.  For  each  t  >  0, 
let  Z{t)  be  an  5-valued  r.v.  having  conditional  distribution  given  by 


F{Z{t)=x\X,  U)  = 


[■7r{x)  -  7rf(x)]+ 


where  [y]"''  ^  y  V  0  for  y  €  JR.  Let  U'  be  a  uniform  r.v.  independent  oi  x,U, 
and  Z{t)  and  set 


z*(t)  =  z{t)iiu'  <  + mm'  > 


mit)) ' 
Mm)’ 


Observe  that 
P(Z*(0  =x\X)  = 


y^E(P(Z(t)=x|X,J7)[l-( 

y€S 


Hy) 

My) 


A  l)]I{Zit)  =  y)\X) 


(7r(x)  A  7r,(x))  +  nm  =  ^1^,  u)[My)  -  ^(y)]+ 

y€S 

7r(x). 


(We  have  used  the  easily  established  fact  that 

Y^My)  -  My)]'^  =  5Zk«(y)  -  7r(y)]+.) 

y  y 

Hence,  Z*{t)  is  a  r.v.  having  the  stationary  distribution  that  is  independent 
of  X.  Set  T*{t)  =  inf{s  >  0  :  X{s)  =  Z*{t)}.  Clearly,  the  aforementioned 
properties  of  Z*{t)  imply  that 


P(XT.(t)  €  •)  =  Px(^  €  •) 

for  t  >  0.  Since  T*{t)  =  T{t)  on  {Z{t)  =  Z*(t)  },  it  follows  that 
|P(XT(t)  e  B)  -  P„(X  e  B)| 

=  |P(XT(t)  €B)-  P(XT.(t)  €  B)l 

=  |P(XT(t)  €  B,  Z(t)  =  Z*(t))  +  P(Xr(t)  €  B,  Z(t)  ^  Z*(t)) 

-  P(XT.(t)  €  B,  Z(t)  =  Z-(t))  -  PiXr^M  €  B,Z(t)  ^  Z*(t))| 

<  \P{XTit)  €  B,  Z{t)  Z*(t))  -  P(Xt.(,)  €  B,  Z{t)  ^  Z*(t))| 

<  P(Z(t)  ^  Z*(t)). 
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We  have  therefore  established  the  following  coupling  inequality: 
l|P(^T(t)  €  •)  -  P,r(X  €  Oil  <  P{Z{t)  ^  Z*(f)). 

But 

PiZ{t)  ^  r{t)\x)  <  P(u>>li^ix)  (6) 

7rt(Z(t)) 

y 

On  the  other  hand,  the  latter  sum  is  just  ||7r(  —  7r||,  which  (5)  asserts  goes  to 
zero  a.s.  The  Bounded  Convergence  Theorem,  applied  to  (6),  then  yields  the 
conclusion 

P(Z(t)  #  Z*(f)) 0 

ast-*  oo,  verifying  (4)c.  As  for  condition  (4)d,  observe  that  for  x  >  0, 

P(T(f)  >  x)  <  P(r(t)  >  X,  Z{t)  /  Z*{t))  +  P{Z{t)  ^  Z*{t)) 

<  P(r-(t)>x)  +  p(z(f)^z*(f)). 

But  r*(t)  has  a  distribution  independent  of  t,  and  is  finite- valued.  Further¬ 
more,  P{Z{t)  ^  Z*{t))  — ►  0  as  <  — >  oo,  establishing,  for  each  e  >  0,  existence  of 
X  =  x(e)  and  t(e)  such  that  P(r(t)  >  x)  <  e  for  t  >  t{e).  On  the  other  hand, 
over  [0,t(e)],  the  non-explosiveness  of  X  guarantees  that  there  exists  a  finite 
deterministic  set  C  S  such  that 

P(7rt(e)(5t(,))  =  1)  >  1  -  e. 

On  the  event  { =  1 }  (namely,  those  outcomes  for  which  X  spends  the 
entire  interval  [0,t(e)]  in  S't(,)),  T{u)  {0  <  u  <  t)  is  bounded  by  max{min{s  > 
0  :  X(s)  =  y  },  y  e  St(c)  },  which  is  a  finite  r.v.  independent  of  u.  Consequently, 
we  can  find  x'(e)  for  which  P(r(n)  >  x'(e))  <  2e  for  0  <  u  <  t{e).  This  proves 
the  required  tightness,  and  completes  the  proof  of  the  second  major  theorem  in 
this  paper. 

Theorem  2  If  X  is  an  irreducible  positive  recurrent  continuous-time  Markov 
chain  taking  values  in  a  finite  or  countably  infinite  state  space,  then  the  algo¬ 
rithm  (l)-(3)  produces  a  family  of  r.v.  ’s  {T{t)  :  t  >  0)  having  properties  (4)a 
-  d. 

Note  that  for  a  given  model,  one  has  no  guarantee  that  a  specifically  chosen 
time  horizon  t  will  be  sufficiently  large  so  that  the  asymptotics  aissociated  with 
(4)c  are  in  force.  While  this  is  clearly  a  drawback,  the  same  drawback  is  shared 
by  (for  example)  most  applications  of  the  central  limit  theorem  in  a  statistical 
environment  (in  which  one  is  never  certain  as  to  whether  the  sample  size  is  suffi¬ 
ciently  large  so  as  to  guarantee  a  good  normal  approximation).  Mote,  however. 
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that,  as  in  section  2,  bounds  on  total  variation  distance  do  not  directly  translate 
into  bounds  on  bias. 

Nevertheless,  we  believe  that  the  algorithm  (l)-(3)  has  sufficient  practical 
merit  so  as  to  be  worthy  of  further  investigation.  Additional  properties  of  this 
algorithm  will  be  described  in  a  forthcoming  paper;  see  Glynn  [7]. 
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